feat(server): --chat-template-file flag for Jinja chat templates#248
Conversation
There was a problem hiding this comment.
1 issue found across 7 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
Thanks @cubic-dev-ai for catching this — issue addressed in fa86c97c (or whatever the new commit hash is, see latest push). The CLI parser now matches the help text: a 0-byte |
@sanastasiou Nice, that looks right. The commit message and code both match the behavior now: zero-byte file → warning + empty Tip: get faster answers by chatting with cubic’s review copilot in the review UI |
Adds a `--chat-template-file PATH` CLI flag to dflash_server that loads a
Jinja chat template from disk and uses it to render the prompt, overriding
the hardcoded QWEN3 / LAGUNA renderer in chat_template.cpp.
Why
---
The existing hardcoded Qwen3.5 ChatML template + tool preamble is
adequate for plain chat but it ships with one specific way of telling
the model how to emit tool calls (the `<tool_call><function=NAME>` XML
format). Real-world Qwen3.6 deployments need template flexibility:
* Community-fine-tuned variants of Qwen3.6 (e.g. froggeric's
"Qwen-Fixed-Chat-Templates") publish their own .jinja files. Without
--chat-template-file the server can't use them.
* Agentic clients like claude-agent-sdk send tool definitions in
Anthropic shape, expect the model to emit tool calls that the
server's tool_parser can lift back into Anthropic tool_use blocks.
Different templates give the model different XML-format instructions,
which directly affects how reliably the model emits well-formed
`<tool_call>...</tool_call>` blocks across long, tool-heavy contexts.
* llama.cpp ships ~50 reference templates in models/templates/*.jinja
— most users will want to point at one of those rather than write
their own hardcoded C++ renderer.
This mirrors llama-server's existing `--jinja --chat-template-file`
flow but lives directly in dflash_server.
What
----
1. New `render_chat_template_jinja(template_src, messages, bos, eos,
add_generation_prompt, enable_thinking, tools_json)` in
chat_template.cpp. Mirrors llama.cpp's
common_chat_template_direct_apply_impl: builds a JSON input matching
the field names every Jinja chat template expects (messages, tools,
bos_token, eos_token, add_generation_prompt, enable_thinking),
parses + runs the template, returns the rendered prompt string.
2. Thread-local cache of the most-recently parsed jinja::program keyed
on the literal template source. Steady-state cost is one
runtime::execute() per request — no re-lex/re-parse — without
introducing global mutable state.
3. The 7 jinja sources from `deps/llama.cpp/common/jinja/`
(lexer/parser/runtime/value/string/caps) plus `common/unicode.cpp`
(used by jinja's tojson() helper) are pulled into the dflash_common
static lib. `deps/llama.cpp/common` is added as a PRIVATE include
path. nlohmann_json was already a PUBLIC link dep.
4. New ServerConfig::chat_template_src / chat_template_path fields.
server_main.cpp parses `--chat-template-file PATH`, reads the file
into memory once at startup, logs the load. http_server.cpp's chat
handler routes to render_chat_template_jinja() when the template
source is non-empty, falling back to the hardcoded QWEN3/LAGUNA
render when it's empty.
5. BOS/EOS strings are pulled from `tokenizer_.raw_token(bos_id())` /
`raw_token(eos_id())` rather than decoded — special tokens like
`<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2
byte-level decode would otherwise produce mojibake.
6. Render failures (lex/parse/runtime/bad tools JSON) throw
std::runtime_error, surfaced as a 500 response on the chat handler.
Verified by
-----------
7 new unit tests in test_server_unit.cpp covering:
- basic message render
- add_generation_prompt off
- tools array injected and accessible via {{ tools[0].name }}
- "[]" tools list correctly treated as empty (no `tools` key in ctx)
- bos_token / eos_token threaded through to template
- empty template_src throws
- malformed tools JSON throws
End-to-end smoke against /v1/messages with the froggeric Qwen3.6
template: a get_weather tool definition + a "what's the weather in
Tokyo" prompt produced a proper Anthropic tool_use block
(`{"type":"tool_use","name":"get_weather","input":{"city":"Tokyo"}}`).
Files
-----
dflash/CMakeLists.txt +16 (jinja sources + include path)
dflash/src/server/chat_template.h +26 (new fn declaration)
dflash/src/server/chat_template.cpp +109 (impl + thread-local cache)
dflash/src/server/http_server.h +6 (ServerConfig fields)
dflash/src/server/http_server.cpp +37 (dispatch in chat handler)
dflash/src/server/server_main.cpp +31 (CLI flag + file read)
dflash/test/test_server_unit.cpp +105 (7 jinja unit tests)
The usage text added in the previous commit promised:
--chat-template-file <path> Load a Jinja chat template file.
Overrides the hardcoded Qwen3/Laguna
renderer. Empty or missing falls back
to the hardcoded template.
… but the CLI parser was aborting startup with `return 1` whenever the
file length was <= 0, contradicting the "empty falls back" half of the
promise.
Behavior change: when --chat-template-file points at a 0-byte file we
now log a warning and leave ServerConfig::chat_template_src empty, so
http_server.cpp's chat handler falls through to render_chat_template()
(the hardcoded QWEN3/LAGUNA path) as documented. Non-empty files are
unchanged; short-read errors still abort.
This makes scripted launches resilient to a transient empty template
file (e.g. a half-written sed pipe, a checked-out-but-not-populated
template path) — the server starts and serves with the hardcoded
template instead of refusing to come up.
Identified by cubic.
b7ad386 to
8d6ad73
Compare
|
Thanks for the contribution @sanastasiou |
|
@davide221 you'e welcone.. I was able to solve real coding challenged with these new addition while using cluade-code as harness and qwen as backend driven by lucebox. Unfortunately after a time something apparently fills up .. not sure if dflash / context and crashes :D .. also the speculative decode drops to 40% or less when context is above 50k or so.. |
Why
The existing hardcoded Qwen3.5 ChatML template + tool preamble in `chat_template.cpp` is adequate for plain chat but ships with one specific way of telling the model how to emit tool calls (the `<tool_call><function=NAME>` XML format). Real Qwen3.6 deployments need template flexibility:
This mirrors `llama-server`'s existing `--jinja --chat-template-file` flow but lives directly in `dflash_server` so users don't have to layer two binaries.
What
New `render_chat_template_jinja(template_src, messages, bos, eos, add_generation_prompt, enable_thinking, tools_json)` in `chat_template.cpp`. Mirrors `llama.cpp`'s `common_chat_template_direct_apply_impl`: builds a JSON input matching the field names every Jinja chat template expects (`messages`, `tools`, `bos_token`, `eos_token`, `add_generation_prompt`, `enable_thinking`), parses + runs the template, returns the rendered prompt string.
Thread-local cache of the most-recently parsed `jinja::program` keyed on the literal template source. Steady-state cost is one `runtime::execute()` per request — no re-lex/re-parse — without introducing global mutable state.
Build wiring — the 6 jinja sources from `deps/llama.cpp/common/jinja/` (`lexer/parser/runtime/value/string/caps`) plus `common/unicode.cpp` (`common_parse_utf8_codepoint` used by jinja's `tojson()` helper) are added to the `dflash_common` static lib. `deps/llama.cpp/common` is added as a `PRIVATE` include path. `nlohmann_json` is already a `PUBLIC` link dep.
CLI + ServerConfig — `server_main.cpp` parses `--chat-template-file PATH`, reads the file into memory once at startup, stores it on `ServerConfig::chat_template_src` and logs the load. `http_server.cpp`'s chat handler routes to `render_chat_template_jinja()` when the source is non-empty, falling back to the hardcoded QWEN3/LAGUNA render when it's empty.
BOS/EOS handling — pulled from `tokenizer_.raw_token(bos_id())` / `raw_token(eos_id())` rather than `token_text()` — special tokens like `<|im_start|>` are stored verbatim in the GGUF vocab and the GPT-2 byte-level decode would otherwise produce mojibake.
Error handling — lex/parse/runtime/bad-tools-JSON failures throw `std::runtime_error`, surfaced as a 500 response on the chat handler with the underlying error message.
Usage
```bash
./dflash_server /path/to/target.gguf \
--draft /path/to/draft.gguf \
--chat-template-file /path/to/qwen3.6-froggeric.jinja \
--port 18080 ...
```
If `--chat-template-file` is omitted, behavior is identical to today (hardcoded QWEN3/LAGUNA renderer).
Test plan
Files
```
dflash/CMakeLists.txt +16 (jinja sources + include path)
dflash/src/server/chat_template.h +26 (new fn declaration)
dflash/src/server/chat_template.cpp +109 (impl + thread-local cache)
dflash/src/server/http_server.h +6 (ServerConfig fields)
dflash/src/server/http_server.cpp +37 (dispatch in chat handler)
dflash/src/server/server_main.cpp +31 (CLI flag + file read)
dflash/test/test_server_unit.cpp +105 (7 jinja unit tests)
```
Design notes / open questions